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Abstract 

Background: Over the past years, reports have indicated that honey bee populations are declining and that infestation 
by an ecto-parasitic mite {Vorroa destructor) is one of the main causes. Selective breeding of resistant bees can help to 
prevent losses due to the parasite, but it requires that a robust breeding program and genetic evaluation are 
implemented. Genomic selection has emerged as an important tool in animal breeding programs and simulation 
studies have shown that it yields more accurate breeding value estimates, higher genetic gain and low rates of 
inbreeding. Since genomic selection relies on marker data, simulations conducted on a genomic dataset are a 
pre-requisite before selection can be implemented. Although genomic datasets have been simulated in other species 
undergoing genetic evaluation, simulation of a genomic dataset specific to the honey bee is required since this species 
has a distinct genetic and reproductive biology. Our software program was aimed at constructing a base population by 
simulating a random mating honey bee population. A forward-time population simulation approach was applied since 
it allows modeling of genetic characteristics and reproductive behavior specific to the honey bee. 

Results: Our software program yielded a genomic dataset for a base population in linkage disequilibrium. In addition, 
information was obtained on (1) the position of markers on each chromosome, (2) allele frequency, (3) x 2 statistics for 
Hardy-Weinberg equilibrium, (4) a sorted list of markers with a minor allele frequency less than or equal to the input value, 
(5) average r 2 values of linkage disequilibrium between all simulated marker loci pair for all generations and (6) average r 2 
value of linkage disequilibrium in the last generation for selected markers with the highest minor allele frequency. 

Conclusion: We developed a software program that takes into account the genetic and reproductive biology specific to 
the honey bee and that can be used to constitute a genomic dataset compatible with the simulation studies necessary to 
optimize breeding programs. The source code together with an instruction file is freely accessible at http://msproteomics. 
org/Research/Misc/honeybeepopulationsimulator.html 



Background 

The honey bee is an economically important species that 
serves as a major pollinator of wild plants and agricultural 
crops. Over the past years, a decline of honey bee popula- 
tions has been reported [1-3] mainly caused by infestation 
with an ecto-parasitic mite (Varroa destructor). Selective 
breeding of resistant bees can help prevent losses due to the 
parasite, but it requires that a robust breeding program and 
genetic evaluation are implemented. Until now, the ap- 
proach used in honey bee is a traditional breeding program 
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based on the best linear unbiased prediction (BLUP) relying 
on pedigree data [4]. Recently, genomic selection strategies 
based on molecular marker data have emerged as a promis- 
ing approach [5]. Marker-based selection has been widely 
tested in several species either with simulated datasets e.g. 
[6,7] or with real datasets e.g. [8-11] but, to date, not in 
honey bee. Since simulation studies require molecular gen- 
etic and pedigree datasets to ascertain selection methods, 
our primary aim was to develop a software program capable 
of producing a dataset for a base population in the honey 
bee and which could be used to implement marker-based 
selection procedures. A base population is used as a starting 
point for simulation studies. It is composed of individuals in 
a pedigree for which no ancestral information is available, 
and is assumed to be in linkage disequilibrium (LD). The 
second goal of our software program was to provide the 
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user with the possibility to investigate the effect of para- 
meters like mutation rate, density of markers and number 
of individuals on the extent of LD in a population. 

Implementation 

Honey bee has specific genetic and reproductive characteris- 
tics, such as a high recombination rate, haplo-diploid sex de- 
termination, arrhenotoky and polyandry, which require 
appropriate software for modeling. The features of our soft- 
ware program and the modeling of population structure, 
genome and evolutionary processes are described below. In 
order to simulate a dataset according to the requirements 
specific to honey bee populations, it is necessary to design a 
software program that will allow to input: (1) number of gen- 
erations, (2) number of sire queens, (3) number of dam 
queens, (4) number of marker loci, (5) forward and backward 
mutation rates, (6) minor allele frequencies and (7) number 
of marker loci to be selected as SNP (single nucleotide poly- 
morphism) on the basis of minor allele frequency. The 
MATLAB source code and an instruction file on how to use 
the software program can be found on its website [12]. 

Setting up of the population structure 

A population can be structured according to the provided 
input. The input data include number of sire queens and 
dam queens (with a ratio of 10:1), number of generations 
and total number of marker loci to be simulated (assumed 
to be bi-allelic). In order to model the haploid drones, a 
sire queen is defined as representing a drone-producing 
colony. Two matrices with a size equal to number of indi- 
viduals by number of marker loci represent the genome of 
diploid sire queens and dam queens, respectively. The 
population size is kept constant in every generation 
according to the Fisher- Wright population model. Further- 
more, all simulated generations are non-overlapping. 

Genome simulation 

A diploid genome, consisting of 16 linkage groups, is 
simulated for sire and dam queens. The length of each 
chromosome is simulated according to the actual length 
of all honey bee chromosomes. The number of marker 
loci (N) to be simulated along the genome can be pro- 
vided as input. In the software program, the number of 
marker loci to be distributed per chromosome, N{ 

(i= 1,2, 16), is based on the actual proportion of SNP 

loci present on each honey bee chromosome and is com- 
puted using the following formula: 

Number of marker loci on the i th chromosome, A/; = NRi 

where R t is equal to the actual ratio between number of 
SNP loci on the i th chromosome and total number of SNP 
loci in the honey bee genome. Positions of all the loci on all 
chromosomes are sampled from a uniform distribution. 



The number of SNP loci per chromosome and the length 
of all 16 chromosomes were obtained from the honey bee 
genome database [13,14] as shown in Table 1. 

Evolution simulation 

To simulate an evolutionary process, recombination and 
mutation are implemented during the process of gamete 
formation in every generation. Multiple mating is modeled 
in the parental generations. The processes are briefly 
described below. 

Recombination 

Recombination is the exchange of chromosomal segments 
between paternal and maternal chromosomes and is imple- 
mented as follows. The recombination probability (6) be- 
tween two adjacent loci on a chromosome is calculated 
from the Haldane mapping function [15], which is the most 
widely used mapping function. It is based on the assump- 
tion that crossovers in any given chromosomal segment fol- 
low a Poisson distribution, with no interference between 
crossovers. In the software program, the recombination 
probability is calculated using the following expression: 

9 = i[l-exp(-2|*|)] 

where exp denotes the exponential function and \x\ stands 
for the absolute value of the map distance between adjacent 

Table 1 Summary of the simulated chromosome length, 
number of SNP and Rj 



Chromosome Length (in base-pairs) Number of SNP Rj 



1 


29893408 


140148 


0.1414 


2 


15549267 


62801 


0.0633 


3 


13234341 


70577 


0.0712 


4 


12718334 


55407 


0.0559 


5 


14363272 


62750 


0.0633 


6 


18472937 


78086 


0.0788 


7 


13219345 


59210 


0.0597 


8 


13546544 


61811 


0.0623 


9 


11120453 


55302 


0.0558 


10 


12965953 


50243 


0.0507 


11 


14726556 


68972 


0.0696 


12 


11902654 


57616 


0.0581 


13 


10288499 


50380 


0.0508 


14 


10253655 


48322 


0.0487 


15 


10167229 


38452 


0.0388 


16 


7207165 


31295 


0.0316 


Total 


219629612 


991372 





Chromosome length and SNP data were obtained from the honey bee 
genome database; Rj is the actual ratio between the number of SNP on the i th 
chromosome and the total number of SNP in the honey bee genome; this 
information is used to simulate the genome and distribute markers across the 
chromosomes. 
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loci. The Haldane mapping function requires that distances 
are expressed in Morgan units, therefore, distances between 
adjacent loci are converted from base-pairs to Morgan 
using the reported recombination rate of 19 cM/Mb 
[16,17]. 

Mutation 

Mutation is implemented in our software program to cre- 
ate polymorphisms. All loci in all individuals belonging to 
generation zero have a single allele coded as 1. We model 
both forward and backward mutations, allowing each 
locus to mutate from allele 1 to allele 2 and from allele 2 
to allele 1. The required rates of forward and backward 
mutations can be specified in the input as mutation rates 
per locus per gamete per generation. The advantage of 
modeling a bi-directional mutation is that different values 
of forward and backward mutation rates can be chosen. 
Setting the backward mutation rate to zero will result in 
an infinite site model of mutation [18] where each locus 
can only mutate once in the entire generation and muta- 
tion will result in the formation of allele 2. The infinite site 
model of mutation can be useful when simulating an ex- 
tremely high initial marker density with a low mutation 
rate and larger number of generations as shown in studies 
by Sorrenson and Meuwissen [6] and Calus et al. [19], 
where 1 million and 300 000 marker loci were simulated 
for a genome of size 10 M and 3 M, respectively. 

Multiple mating 

Figure 1 describes the general mating scheme followed 
during the simulation. Polyandry, commonly referred to as 



multiple mating, is a phenomenon observed in honey bee 
whereby a queen mates with multiple drones (average of 
10 to 20 drones). To model this situation in our software 
program, a dam queen and a group of 10 sire queens (a 
sire queen represents a drone-producing colony) are 
assumed as mating partners. To form groups, all sire 
queens are randomly permuted and thereafter divided into 
groups consisting of 10 sire queens. 

A detailed mating scheme, showing how gametes from 
a dam queen are combined with the drones from a group 
of sire queens is illustrated in Figure 2. A dam queen 
generates a total of 11 gametes, of which 10 give rise to 
sire queens and one a dam queen in the next generation. 
Since a gamete produced by a sire queen is regarded as a 
drone, it is assumed to occur in multiple copies. One of 
the 10 sire queens of a group contributes a drone, which 
combines with a gamete from the dam queen to produce 
a new dam queen for the next generation. In addition to 
the drone generated for the formation of a dam queen, 
each of the 10 sire queens of a group produces one 
drone, thus a group contributes a total of 11 drones. 
During the formation of a sire queen, all 11 drones of a 
group have an equal probability to be drawn as a gamete. 
Since drones in a set are sampled with replacement, the 
resulting progenies are related as super-sibs (coefficient 
of relatedness = 0.75), full-sibs (coefficient of related- 
ness = 0.5) or half-sibs (coefficient of relatedness = 0.25). 

Statistics 

The software program provides statistics for allele fre- 
quency, Hardy- Weinberg equilibrium, minor allele 
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Figure 1 General mating scheme. m = total number of dam queens; since there is a 1:10 ratio between number of dams and sires, the number 
of sire queens is "mx10"; in every generation, all sire queens are randomly permuted and grouped; each group consists of 10 sire queens; a dam 
queen and a group of sire queens are the mating partners; all generations are non-overlapping and the population size is kept constant across 
generations. 
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Figure 2 Multiple mating between a dam queen and 11 drones from a group. The resulting offspring consist of one dam queen and 10 sire 
queens; all drones are sampled with replacement which models the phenomenon of producing multiple copies of identical gametes by a drone. 



frequency and LD, which allow inferences to be made about 
an evolving population. Allele frequency data is a requisite 
for any population; hence the software program creates an 
output with allele frequencies for all loci in the base popula- 
tion along with the output for Hardy- Weinberg equilibrium 
and minor allele frequency, which are the usual criteria to 
evaluate marker loci informativeness. Most studies based 
on genome-wide marker data rely on the assumption that a 
marker and the locus affecting the trait are in LD, therefore, 
in the last generation, the average LD value is calculated for 
selected SNP with the highest minor allele frequency. In 
addition, generation-wise LD values for all simulated mar- 
ker loci pairs are calculated and plotted on a graph. As a 
measure of LD, we used r 2 [20,21] which was calculated as 
follows: 



where D =f(A 1 B 1 )f(A 2 B 2 ) -f(A 1 B 2 )f(A 2 B 1 ) and /(AA), 
f{A 2 B 2 ), /(AA), f(A 2 B x \ /(Ax), f{A 2 ), f{B x \ f(B 2 ) are 
observed frequencies of haplotypes A{Bi, A 2 B 2 , A]B 2 , A 2 B! 
and of alleles A 1? A 2 , B 2 respectively in the population. 

Results and discussion 

The software program developed in this study produces 
the following output: (1) the position of marker loci in 
base-pairs on each chromosome, (2) genotypes of all 
simulated individuals in the last generation, which can 
be used as a base population in further studies, (3) aver- 
age r 2 values of LD for all simulated marker loci pair for 
every simulated generation as a plot as well as a text file, 



(4) average r 2 value of LD for the selected SNP with the 
highest minor allele frequency in the last generation, (5) 
allele frequencies for all loci in the base generation, (6) 
X 2 statistics for Hardy- Weinberg equilibrium for all loci 
in the base generation and (7) a list of marker loci with a 
minor allele frequency smaller than or equal to the input 
value. 

Example 

To provide an illustration, we performed a run as ex- 
ample with the following input parameters: 

- Number of generations: 2000 

- Number of sire and dam queens: two population 
sizes were simulated; the first one consisted of 500 
sire and 50 dam queens and the second one of 200 
sire and 20 dam queens. 

- Total number of marker loci: 100 000 

- Number of marker loci to be selected as SNP: 44 000 

- Forward and backward mutation rates: 0.0025 

- Minor allele frequency: 0.05 

The values of mutation rate, recombination rate, marker 
density and effective population size determine both the 
level of LD in a population and the number of generations 
required to reach a certain value of LD. For the dataset ex- 
ample, the chosen mutation rate was set at 0.0025, a value 
similar to that used by Meuwissen et al. [5], allowing a 
high probability of polymorphic marker loci. Information 
on the level of recombination and on the effective popula- 
tion size (N e ) in honey bee was obtained from Beye et al. 
[17] and Estoup et al. [22], respectively. 
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All data and result files for this example can be found at 
the project home page [12]. In this paper, we only present 
the output graph (Figure 3) showing the establishment of 
LD for the simulated dataset with 220 queens. The graph 
shows the average LD reached in each generation and was 
obtained by calculating the average of LD value between 
all simulated pair of marker loci across the genome, not 
preselected on the basis of minor allele frequency. 

In addition, we compared the expected average LD to 
the achieved average LD for 44 000 loci with the highest 
minor allele frequencies, which were chosen as SNP. The 
expected average LD in a population was calculated as 
follows [21]: 

7 5 + 2N e c 



ll + 267V e c + 8Af 2 c 2 



where c is the recombination fraction between adjacent 
loci and N e is the effective population size. Since the 
total size of the simulated genome was 219 629 612 
base-pairs (Table 1) and the approximate recombination 
rate was taken as 19 cM/Mb [16,17], the size of the 
simulated genome was 41.73 M. Thus, for a genome of 
41.73 M, c was approximately 0.001 for 44 000 SNP. The 
honey bee population has a wide range of effective popu- 
lation sizes [22]; therefore, we simulated two scenarios, 
one with 220 queens and other with 550 queens, respect- 
ively. The effective population size in the honey bee was 
calculated using the following expression for a haplo- 
diploid population [23,24]: 



9N f N m 



2(2N m +N f ) 



where Nf is the number of queens (which is equal to the 
number of colonies since each colony is headed by a single 
queen) and N m is the number of males. In the simulation, 
we assume that each queen is inseminated by 11 drones, 



therefore N m = llNf. Thus, with 220 and 550 colonies, N e 
was approximately 473 and 1184, respectively. With N e = 
473, the expected LD was 0.24 and the achieved LD was 
0.23. Similarly, with A/^1184, the expected and achieved 
LD were equal to 0.14 and 0.11, respectively. These values 
show that our software program is able to model the honey 
bee population with good accuracy. 

Creating a dataset for a base population is a prerequis- 
ite for any simulation study, but it can be time consum- 
ing and it requires testing of optimum parameters. In 
most of the available software used to simulate popula- 
tions [25], base population simulation is the preliminary 
stage, and is done by allowing the population to evolve 
through a burn-in period till the population reaches equi- 
librium from a random or uniform initial state. In our 
software program, all individuals in the starting gener- 
ation are assumed to be unrelated. To establish LD, ran- 
dom mating is performed for the required number of 
generations and the final generation, which is in mutation- 
drift equilibrium, is taken as the base population. In gen- 
omic selection studies, a base population is the common 
starting point from which a population evolves further 
according to specific study requirements. To the best of 
our knowledge, this is the first software program that deals 
with evolutionary aspects in honey bee. It aims at provid- 
ing an impetus to simulation studies in honey bee. It is an 
important initiative and we are developing strategies to 
simulate other honey bee datasets that could be used to 
implement genomic selection and furthermore to extend 
the study with a real genotyping dataset obtained from the 
SNP assay developed for honey bee [26]. The code is writ- 
ten in MATLAB but can be easily adapted to the open 
source version Octave. 

Conclusions 

Our software program can construct a base population 
in LD by simulating a random mating honey bee 




600 800 1000 1200 1400 1600 1800 2000 
Number of generations 



Figure 3 The average value of r 2 plotted against number of generations for the input parameter values. Simulation was performed for 
2000 generations with a forward and backward mutation rate of 0.0025 for 100 000 marker loci and 220 colonies (20 dam queens and 200 sire 
queens); with the parameter values chosen here, a stable LD is reached after random mating. 
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population given some input population parameters. The 
statistics relevant to a population such as allele frequency, 
r 2 value for LD and data for marker sorting according to 
minor allele frequency and Hardy- Weinberg equilibrium 
are provided in output files. The software program is rele- 
vant for research requiring a simulated molecular genetic 
honey bee dataset such as studies aiming at optimizing 
honey bee breeding programs. 

Availability and requirements 

Project name: Honey Bee Population Simulator 

Project home page: http://msproteomics.org/Research/ 

Misc/honeybeepopulationsimulator.html 

Operating system(s): Platform independent 

Programming language: MATLAB 

Other requirements: Tested for MATLAB version 7.9.0.529 
(R2009b) and higher 

License: The source code is available free of charge 
Any restrictions to use by non-academics: none 

Abbreviations 

BLUP: Best Linear Unbiased Prediction; LD: Linkage Disequilibrium; 
SNP: Single Nucleotide Polymorphism. 
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